Developing Asian language corpora: standards and practice

نویسندگان

  • Zhonghua Xiao
  • Tony McEnery
  • Paul Baker
  • Andrew Hardie
چکیده

This paper first discusses standards for developing Asian language corpora so as to facilitate international data exchange. Following this, we present two corpora of Asian languages developed at Lancaster University – the EMILLE Corpus, which contains 14 South Asian languages, and the Lancaster Corpus of Mandarin Chinese. Finally, we will demonstrate how to explore these corpora using Xara and other corpus tools.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Standards & best practice for multilingual computational lexicons: ISLE MILE and more

ISLE (International Standards for Language Engineering) is a transatlantic standards oriented initiative under the Human Language Technology (HLT) programme within the EU-US International Research Co-operation. It is a continuation of the European EAGLES (Expert Advisory Group for Language Engineering Standards) initiative, carried out through a number of subsequent projects funded by the Europ...

متن کامل

Vocabulary Lists for EAP and Conversation Students

Despite the abundance of research investigating general and academic vocabularies and developing dozens of word lists, few studies have compared academic vocabulary with general service word lists such as conversation vocabulary. Many EAP researchers assume that university students need to know all the words in West’s (1953) General Service List (GSL) as a prerequisite to academic words (e.g., ...

متن کامل

Assessing frequency changes in multistage diachronic corpora: Applications for historical corpus linguistics and the study of language acquisition

The use of corpora that are divided into temporally ordered stages is becoming increasingly wide-spread in historical corpus linguistics. This development is partly due to the fact that more and more resources of this kind are being developed. Since the assessment of frequency changes over multiple periods of time is a relatively recent practice, there are few agreed-upon standards of how such ...

متن کامل

Metadiscourse Use in Popular and Professional Science: The Case of Hedges and Boosters

The present article shows that all scientific texts included in journals, magazines, and newspapers are vulnerable to the penetration of hedges and boosters.  However, it was found that scientific texts in the three corpora tended to open up the possibilities of alternative voices rather than narrowing them down. The relatively higher frequency of occurrence of hedges in comparison with booster...

متن کامل

Query Language for Access to Speech Corpora

With more and more speech corpora at hand the unit selection technique is a promising approach in concatenative speech synthesis. What is missing are models of optimal parameters that sufficiently describe utterances to be produced and their corresponding counterparts in collections of speech data. Prior to this, existing corpora have to be annotated on possibly relevant linguistic and signal l...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004